Policy-Adaptive Estimator Selection for Off-Policy Evaluation

نویسندگان

چکیده

Off-policy evaluation (OPE) aims to accurately evaluate the performance of counterfactual policies using only offline logged data. Although many estimators have been developed, there is no single estimator that dominates others, because estimators' accuracy can vary greatly depending on a given OPE task such as policy, number actions, and noise level. Thus, data-driven selection problem becoming increasingly important significant impact OPE. However, identifying most accurate data quite challenging ground-truth estimation generally unavailable. This paper thus studies this for first time. In particular, we enable an adaptive task, by appropriately subsampling available constructing pseudo useful underlying task. Comprehensive experiments both synthetic real-world company demonstrate proposed procedure substantially improves compared non-adaptive heuristic. Note complete version with technical appendix arXiv: http://arxiv.org/abs/2211.13904.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Eligibility Traces for Off-Policy Policy Evaluation

Eligibility traces have been shown to speed reinforcement learning, to make it more robust to hidden states, and to provide a link between Monte Carlo and temporal-difference methods. Here we generalize eligibility traces to off-policy learning, in which one learns about a policy different from the policy that generates the data. Off-policy methods can greatly multiply learning, as many policie...

متن کامل

Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

In this paper we present a new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy. The ability to evaluate a policy from historical data is important for applications where the deployment of a bad policy can be dangerous or costly. We show empirically that our algorithm produces estimates that often have ...

متن کامل

Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

We study the off-policy evaluation problem— estimating the value of a target policy using data collected by another policy—under the contextual bandit model. We consider the general (agnostic) setting without access to a consistent model of rewards and establish a minimax lower bound on the mean squared error (MSE). The bound is matched up to constants by the inverse propensity scoring (IPS) an...

متن کامل

High-Confidence Off-Policy Evaluation

Many reinforcement learning algorithms use trajectories collected from the execution of one or more policies to propose a new policy. Because execution of a bad policy can be costly or dangerous, techniques for evaluating the performance of the new policy without requiring its execution have been of recent interest in industry. Such off-policy evaluation methods, which estimate the performance ...

متن کامل

Off-policy evaluation for slate recommendation

This paper studies the evaluation of policies which recommend an ordered set of items based on some context—a common scenario in web search, ads, and recommender systems. We develop a novel technique to evaluate such policies offline using logged past data with negligible bias. Our method builds on the assumption that the observed quality of the entire recommended set additively decomposes acro...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2023

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v37i8.26195